3.8.1 What is the problem with this plot? How could you improve it?

When running summary() on the dataset, we see there are 234 items in the mpg dataset. You can not see all of the data on the plot since there is a lot of overlap. To solve this we can add jitter and also some color coding and a legend to make the datapoints more distinguishable by model or some other attribute.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.3
## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 3.4.3
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

3.8.2 What parameters to geom_jitter() control the amount of jittering?

Width and height parameters control the amount of horizontal and vertical jittering.

3.8.3 Compare and contrast geom_jitter() with geom_count().

Geom_count makes a point more dense depending on the number of datapoints that are at the overlapping x-y coordinate. Jitter makes the points more distinct by adding a small amount vertical or horizontal offset.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_count()

3.8.4 What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

The default position is “dodge”.

p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot()

3.9.2 What does labs() do?

The labs() function is used to modify axis, legend, And plot Labels.

3.9.4 What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

The plot below shows that greater city miles per gallong is linearly related to greater highway miles per gallon. The coord_fixed() is important for keeping a fixed aspect ratio for the plot since both x and y refer to miles per gallon measures. The geom_abline serves as a reference line for annotating the plot and further illustrates the relationship.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

4.4.1 Why does this code not work?

There is a typo in the name of the variable when trying to display it.

my_variable <- 10
my_varıable
## [1] 10
#> Error in eval(expr, envir, enclos): object 'my_varıable' not found

4.4.2 Tweak each of the following R commands so that they run correctly:

library(tidyverse)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)
filter(diamonds, carat > 3)

eee

5.2.4.1 Find all flights that

Had an arrival delay of two or more hours

filter(flights, arr_delay >= 2)

Flew to Houston (IAH or HOU)

filter(flights, dest == 'IAH' | dest == 'HOU')

Were operated by United, American, or Delta

filter(flights, carrier == 'UA' | carrier == 'AA' | carrier == 'DL')

Departed in summer (July, August, and September)

filter(flights, month == 7 | month == 8 | month == 9)

5.2.4.2 Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

flights %>%
 filter(between(dep_time, 0, 600))

5.2.4.3 How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

8,255 rows have a missing dep_time. The arrival time is also NA. These may be cancelled flights.

flights %>% filter(is.na(dep_time))

5.2.4.4 Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

Anything ORed with true is still true. Anything ANDed with false is false and not NA.

5.4.1.1 Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, dep_time, dep_delay, arr_time, arr_delay)

5.4.1.2 What happens if you include the name of a variable multiple times in a select() call?

It only shows the variable once.

select(flights, dep_time, dep_time)

5.4.1.3 What does the one_of() function do? Why might it be helpful in conjunction with this vector?

The one_of function lets you go through variables in a dataframe by name and select if one of the names exist.

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

5.4.1.4 Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

It selects based on the column’s name containing TIME rather than the value containing time. It is not case sensitive. You could use explicity conversions to lower case first to force case.

select(flights, contains("TIME"))